DS 5110: Big Data Systems | Final Project

State of Virginia Traffic Reliability | MAP21

by Christian Schroeder (dbn5eu), Timothy Tyree (twt6xy), Colin Warner (ynq9ya)


Brief Overview

Project Description: This study used Virginia specific data for a set of independent variables (Year, Period, Precipitation Rate, Volume/Capacity Ratio, Hourly Volume, Presence of Safety Service Protocol, Crashes, Weather Events, Number of Lanes, County, Terrain, Urban Designation, Road Direction, Intersection, Segment Order and Truck Percentage) to predict if a MAP-21 reporting segment was reliable. The data consists of the metrics listed above for each year between 2017 and 2020, for each highway segment in Virginia. Furthermore, VDOT forecasted their metrics out to 2024. Our goal is to use the actual data up to 2020 to find an accurate model using train, test, and validation splits. If that model is found, we can use the forecasted metrics to classify future unreliable segments.

What is MAP-21? "MAP-21, the Moving Ahead for Progress in the 21st Century Act (P.L. 112-141), was signed into law by President Obama on July 6, 2012. Funding surface transportation programs at over \$105 billion for fiscal years (FY) 2013 and 2014, MAP-21 is the first long-term highway authorization enacted since 2005. MAP-21 is a milestone for the U.S. economy and the Nation’s surface transportation program. By transforming the policy and programmatic framework for investments to guide the system’s growth and development, MAP-21 creates a streamlined and performance-based surface transportation program and builds on many of the highway, transit, bike, and pedestrian programs and policies established in 1991." Source: https://www.fhwa.dot.gov/map21/

Notebook Description: In this particular Notebook, we will walk through (1) the importing of our data from the Virginia Department of Transportation, and (2) the necessary preprocessing of our data to ensure our data is in a format that is reliable for Splitting, Exploratory Analysis, and Modeling. To avoid lines of repetitive code (reading, joining, etc.) we wrote a custom preprocessing class. The following section imports said class and explains how it works.


The Data

After the formatting and combining of the input data, the final trainable dataset will be comprised of the following columns:

Response

Predictors


Helper Classes

The below classes are also saved as .py files and are imported for use.

Preprocessor Class

Visualizer Class

Mapper Class

Import Packages, Initialize Spark Session, Read, Combine, and Transform Data to a Workable Format

Below we call the readAndCombineData() function from the Preprocessor class to perform the following tasks;

  1. Create a dictionary of directories with the directory name as key (ex. TMC/) and empty lists as values. This will hold dataframes that can be joined on shared unique identifiers.
  2. Gets the full path to the input directories and uses a formatted string to get get the other directories in a loop. (looping through the directory name keys)
  3. Creates a nested list of lists that define the type of joins each directory will be performing. Ordered the same as the directories.
  4. Joins all data as follows;
    • a) Outer loop through each directory
    • b) Inner loop through the files in each directory and read the file into a Spark dataframe.
    • c) Append the dataframe to the values list within its respective directory (key/outer loop)
    • d) Get out of the inner loop, pop the last data frame out of this list, and save it to a temporary variable. This will be the df that starts joining on each directories respective join identifiers.
    • e) Join dataframes within each directory into one. Results in 3 dataframes after starting with 12. The logic is similar to sorting algorithms. Within another inner loop Start with the dataframe that was popped, set a temp df as that df, create a joined df with the temp df and the current loops df, on the columns specified within the current iterations index location of the join list, then set the temp df as the joined df, and set the start df back to temp. This will successfully join each dataframe within the list on their respective identifiers without repeating or missing dataframes.
    • f) Outside the previous loop, append the start df (which is now the full joined df) to the end of the list, and drop every other df in the list.
    • g) As a sanity check, loop through all columns in the joined df and drop any that may be duplicate.
  5. Sequentially join the final three dfs into one, making sure to join on the df that had more previous identifiers so there's no data loss.
  6. Create trainable and forecasted datasets by filtering on year (trainable < 2021), (forecasted > 2020)
  7. Save the data.

Throughout the process, markdown formatted print statements are output to aid in debugging, and knowing what the function is currently doing.

Exploratory Data Analysis

Visualize data to view distributions and understand how certain variables contribute to highway segment reliability. We will use Python in this section due to Spark having limited visualization services, and using Jupyter Notebooks (opposed to Apache Zeppelin which would allow us to continue using Spark). The following cell loads the trainable data into a pandas dataframe, and instantiates the Visualizer helper class we made.

What is the distribution of the numerical variables?

Exploring how categorical variables affect reliability

Map of Highway Segment Reliability

A mapper helper class was created to help visualize the spatial relationships of the data. The createMapViews() function creates a folium map instance for each combination of year, period, and column.

View The Maps

Note: The viewMaps() functions utilizes ipywidgets and is only visible within a running notebook.

Reliability during AMP in 2017

Most of the unreliable segments are in Norfalk/Va Beach, Richmond, and Northern Virginia. This isn't surprising considering those three regions make up the most populous areas in Virginia. What's surprising is that there is not a single unreliable segment on I-81. I-81 is one of the nations largest trucking corridors, and the traffic can oftentimes be unpredictable. Perhaps District and Road should be taken out of the analysis - we only use numeric and non geographic defining categorical variables.


Perform Transformations

Below we consider whether or not to remove rows for Districts where there are no Unreliable instances.

After removing rows where the District is equal to either Staunton, Culpeper, or Salem, we are left with 20,352. Since our trainable data set is not incredibly large to begin with, we instead decide to move ahead with all 27,312 rows - instead dropping the District and Road variables. This means we will continue on with only numeric and non georgraphic defining categorical variables.

We decided on the schema of the final trainable data and implemented it by log transforming the ALL_WEATHER, AVG_HOURLY_VAL, and PCT-PRECIP-MINS columns, and dropping the District and Road columns.

Building Pipeline

Building a pipeline allowed us to easily tweak and rerun the pipeline, as well as reduce the clutter in the code. The pipeline uses a StringIndexer and OneHotEncoder on the categorical variables, and a VectorAssembler to create the 'features' columns.

Split Data into Train and Test Sets

Randomly split data into 90% training and 10% testing sets. This split was decided on because of the relatively small amount of observations.

Model Construction

All models were run using 10-fold cross-validation and thresholds set at 0.5. Then additional thresholds were tested after determining the optimal tuning parameters.

Single Logistic Regression Model

We started with a single logistic regression model to act as a loose baseline of predictability

Logistic Regression with CV and Tuning

We decided to implement CV and tuning parameters to better fit the logistic regression model

Random Forest

The second model we trained was random forest. At each tree split, a random subset of the predictors is chosen. This method gives less strong variables more of a chance to have an influence.

Decision Tree

After seeing the performance of the random forest model, we wondered how a simple decision tree model would perform on the same data. It was possible that certain predictors played a bigger role in segments being unreliable or not, and we were suppressing that influence.

Because the DecisionTreeClassificationModel does not provide a summary attribute, our plotROC method would not work. We found an alternative way to plot the curve for this model online (source: https://newbedev.com/pyspark-extract-roc-curve)

Model Evaluation

When evaluating the performances of the chosen models on the test data, we looked at the counts of true positive, true negative, false positive, and false negative predictions, as well as the accuracy and AUROC. The metrics we found the most important were accuracy, AUROC, and the false positive predictions. The ROC curves were also plotted and evaluated above. Although the decision tree model did not have the lowest false positive rate, it did have noticeably better accuracy and AUROC values. the We decided to move forward with the decision tree model due to its ability to perform well in Accuracy and AUROC while maintaining a reasonable false positive rate.

Apply 'Best Model' to Forecasted Data

We previously held out forecasted data from the Virginia Department of Transportation for the years 2021-2024. We will now apply the Decision Tree model that was found to have the highest accuracy and AUROC to this forecasted data set. The goal here is for our model to flag future unreliable highway segments for the state of Virginia so that they can direct their attention to programs that will improve the reliability of these segments.

We applied the same transformations to the forecasted data as the trainable data.

Run Forecasted Data Through Pipeline

We first must run the forecasted data through a pipeline of StringIndexer, OneHotEncoder, and Vector Assembler in the same way that we did our historical, trainable data set.

Run formatted data through model, print probability and prediction

Conclusion

The Decision Tree model projected an average of 136 highway segments to be unreliable each year from 2021-2024.The primary goal of our project was to identify future unreliable highway segments for VDOT. According to the state, an unreliable segment must only be classified by a model as unreliable in either the morning, midday, evening, or weekend to be considered unreliable. Our decision tree model has identified on average 136 segments that are projected to be unreliable each year from 2021-2024. We believe these segments should be prioritized when allocating MAP-21 funds to highway improvements.